Lesson 6. Spatial Queries

Spatial analysis is a process that begins with exploring and mapping a dataset and can lead to potentially complex models and visualizations of real world features and phenomena. Spatial queries are the building blocks of this analytical process. These queries are software operations that allow us to ask questions of our data and which return data metrics, subsets or new data objects. In this lesson we explore the two basic types of spatial queries: measurement queries and relationship queries.


Instructor Notes


Types of Spatial Queries

The basic types of spatial queries are:

  • Spatial Measurement Queries
    • What is feature A’s length?
      • What is the length of the BART train line between Walnut Creek and Rockridge?
    • What is feature A’s area?
      • What is the area of Alameda County?
    • What is feature A’s distance from feature B?
      • What is the distance between Berkeley High School and Berkeley BART Station?
    • etc.
  • Spatial Relationship Queries
    • Is feature A within feature B?
      • What schools are in Berkeley?
    • Does feature A intersect with feature B?
      • *What in what cities is Tilden Regional Park located?
    • Does feature A cross feature B?
      • Does the BART line cross into Albany?
    • etc.
  • Combination Spatial Queries
    • What schools in Berkeley are within 1/4 mile of a BART station?

Both measurement and relationship queries operate on the geometry of features in one or in two datasets and are dependent on the type of geometry. For example, with point features you can make distance measurements or ask what points are spatially inside polygon objects. But it would not make sense to compute the area of a point. Polygon features, on the other hand, allow for a wider range of both measurement and spatial relationship queries.

There are important distinctions between these two types of queries. - Measurement queries always depend on the CRS of the data while spatial relationship queries almost always do not. - Measurement queries return a continuous value (e.g. area) while relationship queries evaluate to true or false, and then return the features for which the relationship is true.

Spatial Queries are Special

We already know how to do attribute queries with our data. For example, we can select one or more specific counties by name or select those counties where the total population is greater than 100,000 because we have these columns in the dataset.

Spatial queries are special because they are dynamic. For example, we can compute area from the geometry without it already being encoded or we can select BART stations in Berkeley even if city is not encoded in the BART data by linking those two spatial datasets in the same geographic space. This dynamic query capability is extremely powerful!

In this lesson we’ll work through examples of each of those types of queries.

6.0 Load and prep the data

Load the libraries we will use.

library(sf)
library(tmap)
library(here)

Read in the CA Counties data and then take a look at its geometry and attributes.

# Read in the counties shapefile
counties = st_read(dsn = here("notebook_data", 
                              "california_counties", 
                              "CaliforniaCounties.shp"))
## Reading layer `CaliforniaCounties' from data source 
##   `/Users/pattyf/Documents/Dlab/workshops/AY2022/R-Geospatial-Fundamentals/notebook_data/california_counties/CaliforniaCounties.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 58 features and 24 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -374445.4 ymin: -604500.7 xmax: 540038.5 ymax: 450022
## Projected CRS: NAD83 / California Albers
counties <- st_make_valid(counties)

plot(counties$geometry)

Take a look at the spatial dataframe.

head(counties,2)
## Simple feature collection with 2 features and 24 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -28402.76 ymin: -357924.6 xmax: 216465.7 ymax: -169685.2
## Projected CRS: NAD83 / California Albers
##    NAME STATE_NAME POP2010 POP10_SQMI POP2012 POP12_SQMI  WHITE BLACK AMERI_ES
## 1  Kern California  839631      102.9  851089   104.2829 499766 48921    12676
## 2 Kings California  152982      109.9  155039   111.4274  83027 11014     2562
##   ASIAN HAWN_PI HISPANIC  OTHER MULT_RACE  MALES FEMALES MED_AGE AVE_HH_SZ
## 1 34846    1252   413033 204314     37856 433108  406523    30.7      3.15
## 2  5620     271    77866  42996      7492  86344   66638    31.1      3.19
##   AVE_FAM_SZ HSE_UNITS VACANT OWNER_OCC RENTER_OCC CountyFIPS
## 1       3.61    284367  29757    152828     101782      06103
## 2       3.59     43867   2634     22329      18904      06089
##                         geometry
## 1 MULTIPOLYGON (((213672.6 -2...
## 2 MULTIPOLYGON (((12524.03 -1...

What is the CRS of the Counties data?

  • What are the units of that CRS?

Select just Alameda County and save it to a spatial dataframe

alameda = counties[counties$NAME=='Alameda',]
plot(alameda$geometry)

6.1 Measurement Queries

We’ll start off with some simple measurement queries.

We can get the area of Alameda County with thesf function st_area.

st_area(alameda)
## 1927093261 [m^2]

Okay! We got the area of the county in square meters.

sf uses the units package to manage (get and set) units.

It’s more useful to return the area of large regions in square KM (or sq miles) and we can do that conversion manually by dividing by 1,000,000 (1000 * 1000):

st_area(alameda) / 1000000
## 1927.093 [m^2]

BUT the units package doesn’t respond well to manual unit conversions and still reports the value as m2.

So let’s try that conversion with units.

units::set_units(st_area(alameda), km^2)
## 1927.093 [km^2]

Now you try it! Calculate the area of Alameda County in sq miles.

units::set_units(st_area(alameda), km^2)  ## WHAT SHOULD YOU CHANGE IT TO?
## 1927.093 [km^2]

Always check your measurements

It’s a good idea to check one or two measurements before you automate your workflow to make sure you are getting valid values. If we look up the area of Alameda county on wikipedia we get 739 sq mi (1,910 km2). Are the values returned by st_area valid? Why might they differ?

Calculate Area for the Counties

We can also use st_area to add the area of all counties to the spatial dataframe.

counties$areakm2 <- units::set_units(st_area(counties), km^2)

# take a look
head(counties)
## Simple feature collection with 6 features and 25 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -267387.9 ymin: -578158.6 xmax: 216677.6 ymax: 352693.6
## Projected CRS: NAD83 / California Albers
##          NAME STATE_NAME POP2010 POP10_SQMI POP2012  POP12_SQMI   WHITE  BLACK
## 1        Kern California  839631      102.9  851089  104.282870  499766  48921
## 2       Kings California  152982      109.9  155039  111.427421   83027  11014
## 3        Lake California   64665       48.6   65253   49.082334   52033   1232
## 4      Lassen California   34895        7.4   35039    7.422856   25532   2834
## 5 Los Angeles California 9818605     2402.3 9904341 2423.264150 4936599 856874
## 6      Madera California  150865       70.1  153025   71.065672   94456   5629
##   AMERI_ES   ASIAN HAWN_PI HISPANIC   OTHER MULT_RACE   MALES FEMALES MED_AGE
## 1    12676   34846    1252   413033  204314     37856  433108  406523    30.7
## 2     2562    5620     271    77866   42996      7492   86344   66638    31.1
## 3     2049     724     108    11088    5455      3064   32469   32196    45.0
## 4     1234     356     165     6117    3562      1212   22416   12479    37.0
## 5    72828 1346865   26094  4687889 2140632    438713 4839654 4978951    34.8
## 6     4136    2802     162    80992   37380      6300   72682   78183    33.1
##   AVE_HH_SZ AVE_FAM_SZ HSE_UNITS VACANT OWNER_OCC RENTER_OCC CountyFIPS
## 1      3.15       3.61    284367  29757    152828     101782      06103
## 2      3.19       3.59     43867   2634     22329      18904      06089
## 3      2.39       2.94     35492   8944     17472       9076      06106
## 4      2.50       2.98     12710   2652      6590       3468      06086
## 5      2.98       3.58   3445076 203872   1544749    1696455      06073
## 6      3.28       3.63     49140   5823     27726      15591      06102
##                         geometry          areakm2
## 1 MULTIPOLYGON (((213672.6 -2... 21137.940 [km^2]
## 2 MULTIPOLYGON (((12524.03 -1...  3603.706 [km^2]
## 3 MULTIPOLYGON (((-235734.3 1...  3443.291 [km^2]
## 4 MULTIPOLYGON (((12.28914 35... 12225.762 [km^2]
## 5 MULTIPOLYGON (((173874.5 -4... 10585.866 [km^2]
## 6 MULTIPOLYGON (((16681.16 -1...  5577.010 [km^2]

CRS and Spatial Measurements

Spatial measurements can differ greatly depending on the CRS. Let’s take a look.

# Calculate area using data in WGS84 CRS (4326)
counties$areakm2_wgs84 <- units::set_units(st_area(st_transform(counties,4326)), km^2)

# Calculate area using data in UTM NAD83 zone 10 CRS (26910)
counties$areakm2_utm <- units::set_units(st_area(st_transform(counties,26910)), km^2)

# Calculate area using data in Web Mercator CRS (3857)
counties$areakm2_web <- units::set_units(st_area(st_transform(counties, 3857)), km^2)

# Take a look at a subset of columns only
head(counties[,c('NAME','areakm2','areakm2_wgs84','areakm2_utm','areakm2_web')])
## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -267387.9 ymin: -578158.6 xmax: 216677.6 ymax: 352693.6
## Projected CRS: NAD83 / California Albers
##          NAME          areakm2    areakm2_wgs84      areakm2_utm
## 1        Kern 21137.940 [km^2] 21137.848 [km^2] 21201.508 [km^2]
## 2       Kings  3603.706 [km^2]  3603.104 [km^2]  3608.153 [km^2]
## 3        Lake  3443.291 [km^2]  3440.360 [km^2]  3440.595 [km^2]
## 4      Lassen 12225.762 [km^2] 12210.926 [km^2] 12228.764 [km^2]
## 5 Los Angeles 10585.866 [km^2] 10588.192 [km^2] 10628.068 [km^2]
## 6      Madera  5577.010 [km^2]  5574.650 [km^2]  5584.023 [km^2]
##        areakm2_web                       geometry
## 1 31840.623 [km^2] MULTIPOLYGON (((213672.6 -2...
## 2  5528.030 [km^2] MULTIPOLYGON (((12524.03 -1...
## 3  5725.400 [km^2] MULTIPOLYGON (((-235734.3 1...
## 4 21276.793 [km^2] MULTIPOLYGON (((12.28914 35...
## 5 15559.365 [km^2] MULTIPOLYGON (((173874.5 -4...
## 6  8810.596 [km^2] MULTIPOLYGON (((16681.16 -1...

Let’s discuss the output.

  • CA Albers: The CRS of our source data is an equal area projected CRS that is optimized for area measurements in CA. So these values in the area_km2 column are highly accurate if the underlying geometry is.

  • WGS84: When we computed the area based on the data transformed to WGS84 we got almost identical values. WGS84 is a geographic (unprojected) CRS with angular units expressed as decimal degrees. So why is the output accurate? Well, new versions of the sf package use spherical (rather than Euclidean) geometry calculations on geographic data to quickly and accurately compute area and distance. Pretty awesome!

  • UTM10: This is a CRS optimized for Northern CA areas. So calculations outside the zone, eg SoCal, will be increasingly distorted as the area moves farther from the center of UTM zone 10N.

  • Web Mercator: This is a CRS that preserves shape and distorts area - so not accurate at all for area calculations.

Check out the help documentation for ?st_area for more information. But the important takeaway is that you need to use a CRS that is appropriate for your analysis/mapping needs!

When creating a spatial analysis work flow it is common to start by transforming all of your data to the same, appropriate CRS.

Calculating Length with st_length

We can use the st_length operator in the same way to calculate the length features in a spatial dataframe. Always take note of the output units!

# Load BART Lines
bart_lines <- st_read(here('notebook_data', 'transportation', 'bart_lines_2019.geojson'))
## Reading layer `bart_lines_2019' from data source 
##   `/Users/pattyf/Documents/Dlab/workshops/AY2022/R-Geospatial-Fundamentals/notebook_data/transportation/bart_lines_2019.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: -122.4711 ymin: 37.50197 xmax: -121.7726 ymax: 38.02384
## Geodetic CRS:  WGS 84
head(bart_lines)
## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: -122.4711 ymin: 37.50197 xmax: -121.7726 ymax: 38.02384
## Geodetic CRS:  WGS 84
##   objectid rec_id operator      mode_ class     meters Shape__Len
## 1     1385      3     BART Heavy Rail Rapid 127862.570  1.0567986
## 2     1386      3     BART Heavy Rail Rapid  83001.619  0.6562471
## 3     1387      3     BART Heavy Rail Rapid  88328.491  0.7153645
## 4     1388      3     BART Heavy Rail Rapid  74594.947  0.5934920
## 5     1389      3     BART Heavy Rail Rapid  79278.237  0.6695474
## 6     1390      3     BART Heavy Rail Rapid   6672.096  0.0503038
##                         geometry
## 1 MULTILINESTRING ((-122.3926...
## 2 MULTILINESTRING ((-121.9393...
## 3 MULTILINESTRING ((-121.9393...
## 4 MULTILINESTRING ((-122.3534...
## 5 MULTILINESTRING ((-121.9004...
## 6 MULTILINESTRING ((-122.2124...
bart_lines$len_mi <- units::set_units(st_length(bart_lines), mi)
bart_lines$len_km <- units::set_units(st_length(bart_lines), km)
head(bart_lines)
## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: -122.4711 ymin: 37.50197 xmax: -121.7726 ymax: 38.02384
## Geodetic CRS:  WGS 84
##   objectid rec_id operator      mode_ class     meters Shape__Len
## 1     1385      3     BART Heavy Rail Rapid 127862.570  1.0567986
## 2     1386      3     BART Heavy Rail Rapid  83001.619  0.6562471
## 3     1387      3     BART Heavy Rail Rapid  88328.491  0.7153645
## 4     1388      3     BART Heavy Rail Rapid  74594.947  0.5934920
## 5     1389      3     BART Heavy Rail Rapid  79278.237  0.6695474
## 6     1390      3     BART Heavy Rail Rapid   6672.096  0.0503038
##                         geometry         len_mi          len_km
## 1 MULTILINESTRING ((-122.3926... 62.628715 [mi] 100.791146 [km]
## 2 MULTILINESTRING ((-121.9393... 40.747035 [mi]  65.575997 [km]
## 3 MULTILINESTRING ((-121.9393... 43.377845 [mi]  69.809875 [km]
## 4 MULTILINESTRING ((-122.3534... 36.593701 [mi]  58.891853 [km]
## 5 MULTILINESTRING ((-121.9004... 38.910937 [mi]  62.621082 [km]
## 6 MULTILINESTRING ((-122.2124...  3.275202 [mi]   5.270927 [km]

Calculating Distance

The st_distance function can be used to find the distance between two geometries or two sets of geometries.

First let’s read in the bart station data.

bart_stations <- st_read(here('notebook_data', 'transportation','bart_stations_2019.geojson'))
## Reading layer `bart_stations_2019' from data source 
##   `/Users/pattyf/Documents/Dlab/workshops/AY2022/R-Geospatial-Fundamentals/notebook_data/transportation/bart_stations_2019.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 48 features and 11 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.4691 ymin: 37.50217 xmax: -121.7799 ymax: 38.01891
## Geodetic CRS:  WGS 84
head(bart_stations)
## Simple feature collection with 6 features and 11 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.4475 ymin: 37.72158 xmax: -122.2686 ymax: 37.8528
## Geodetic CRS:  WGS 84
##   objectid cpt_mode stoptype                          ts_locatio at_street
## 1     1299        T       TS BART - 12th St. Oakland City Center      <NA>
## 2     1300        T       TS             BART - 16th St. Mission      <NA>
## 3     1301        T       TS             BART - 19th St. Oakland      <NA>
## 4     1302        T       TS             BART - 24th St. Mission      <NA>
## 5     1303        T       TS                        BART - Ashby      <NA>
## 6     1304        T       TS                  BART - Balboa Park      <NA>
##   on_street sp_citynam agencyname routename                   station_na
## 1      <NA>       <NA>       BART      <NA> 12th St. Oakland City Center
## 2      <NA>       <NA>       BART      <NA>             16th St. Mission
## 3      <NA>       <NA>       BART      <NA>             19th St. Oakland
## 4      <NA>       <NA>       BART      <NA>             24th St. Mission
## 5      <NA>       <NA>       BART      <NA>                        Ashby
## 6      <NA>       <NA>       BART      <NA>                  Balboa Park
##         mode                   geometry
## 1 Rapid Rail POINT (-122.2715 37.80377)
## 2 Rapid Rail POINT (-122.4197 37.76506)
## 3 Rapid Rail POINT (-122.2686 37.80835)
## 4 Rapid Rail POINT (-122.4181 37.75247)
## 5 Rapid Rail  POINT (-122.2701 37.8528)
## 6 Rapid Rail POINT (-122.4475 37.72158)

Compute the distance between two BART stations…

st_distance(bart_stations[bart_stations$station_na=='Ashby',], 
                          bart_stations[bart_stations$station_na=='Downtown Berkeley',])
## Units: [m]
##          [,1]
## [1,] 1931.225

You can also use it to find the distance between multiple features.

st_distance(bart_stations[bart_stations$station_na=='Downtown Berkeley',], bart_stations)
## Units: [m]
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]    [,7]     [,8]
## [1,] 7381.988 17710.64 6866.866 18567.61 1931.225 22828.39 22928.5 26154.46
##          [,9]    [,10]    [,11]    [,12]    [,13] [,14]    [,15]    [,16]
## [1,] 16291.06 14381.92 26989.53 23925.42 25383.09     0 7457.117 37442.66
##         [,17]    [,18]    [,19]    [,20]    [,21]    [,22]    [,23]    [,24]
## [1,] 14208.26 43198.64 11275.03 21073.81 27384.11 12853.06 8129.942 4564.345
##         [,25] [,26]    [,27]    [,28]    [,29]    [,30]    [,31]    [,32]
## [1,] 31764.18 14725 1410.575 25979.78 18122.17 7459.668 19711.16 32801.25
##         [,33]    [,34]   [,35]    [,36]    [,37]    [,38]   [,39]    [,40]
## [1,] 4513.825 15540.83 10519.6 3184.935 18980.86 28933.05 30298.1 32109.34
##         [,41]    [,42]    [,43]    [,44]    [,45]    [,46]    [,47]    [,48]
## [1,] 27619.56 38105.04 50110.36 18039.68 35369.22 7631.367 45024.17 37016.97

6.2 Spatial Relationship Queries

Spatial relationship queries consider how two geometries or sets of geometries relate to one another in space. For example, you may want to know what schools are located within the City of Berkeley or what East Bay Regional Parks have land within Berkeley. You may also want to combine a measurement query with a spatial relationship query. Example, you may want to know the total length of freeways within the city of Berkeley.

Here is a list of some of the more commonly used sf spatial relationship operations.

  • st_intersects
  • st_within
  • st_contains (the inverse of st_within)
  • st_disjoint

These can be used to select features in one dataset based on their spatial relationship to another. In other works, you can use these operations to make spatial selections or create spatial subsets.

Enough talk. Let’s work through some examples.

What Alameda County Schools are in Berkeley?

First, load the CA Places data and select the city of Berkeley and save it to a sf dataframe.

places = st_read(here("notebook_data",
                      "census",
                      "Places",
                      "cb_2019_06_place_500k.shp"))
## Reading layer `cb_2019_06_place_500k' from data source 
##   `/Users/pattyf/Documents/Dlab/workshops/AY2022/R-Geospatial-Fundamentals/notebook_data/census/Places/cb_2019_06_place_500k.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1521 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.2694 ymin: 32.53416 xmax: -114.229 ymax: 41.99314
## Geodetic CRS:  NAD83
berkeley = places[places$NAME=='Berkeley',]

plot(berkeley$geometry)

Then, load the Alameda County schools data and make it a spatial dataframe.

schools_df = read.csv(here("notebook_data",
                           "alco_schools.csv"))

schools_sf = st_as_sf(schools_df, 
                      coords = c('X','Y'),
                      crs = 4326)

Check that the two sf dataframes have the same CRS.

st_crs(schools_sf) == st_crs(berkeley)
## [1] FALSE

They don’t have the same CRS so we need to align them. Let’s transform (or reproject) the CRS of both of these dataframes to UTM10N, NAD83 (EPSG 26910). This is a commonly used CRS for Northern CA data.

# Transform data CRSes
schools_utm10 <- st_transform(schools_sf, 26910)
berkeley_utm10 <- st_transform(berkeley, 26910)

If you look at the Schools data you will see that it has a City column. So we can subset the data by attribute to select the Schools in Berkeley. No need to do a spatial selection. But let’s do it anyway to demonstrate the process.

Assume that the schools data do not have that city column. How can we identify the schools in Berkeley spatially?

Here’s how!

# SPATIALLY select only the schools within Berkeley
berkeley_school = schools_utm10[berkeley_utm10, , op=st_intersects]  # NO QUOTES!!!

Yes that was it! Take a long look at that simple yet powerful spatial selection syntax.

You should interpret that syntax as:

  • Select the features (i.e. rows) in the schools_utm10 dataframe

  • whose geometry spatially intersects the Berkeley_utm10 geometry

  • and keep all of the schools_utm10 columns

    • all because the extraction brackets have no second argument
Important

The op=st_intersects argument is optional because st_intersects is the default spatial selector.

To emphasize this, let’s rerun the last command.

# SPATIALLY select only the schools within Berkeley
berkeley_schools = schools_utm10[berkeley_utm10, ]

What does spatially intersects mean?

Here’s one way to explain it.

Geometry A spatially intersects Geometry B if any of its parts (e.g., a point, line segment, or polygon) is equivalent to, touches, crosses, is contained by, contains, or overlaps any part of Geometry B.

So st_intersects is the most general of all spatial relationships! It is also the most useful. However, you can specify any of those more specific spatial relationships by setting op= to any of the options listed in the ?st_intersects? help documentation.

Let’s check out the sf object that our selection returned.

# How many schools did we get
dim(berkeley_schools)
## [1] 32  8
# Map the results
plot(berkeley_utm10$geometry)
plot(berkeley_schools$geometry, col="red", add = T)

IMPORTANT: The default spatial selection operator is st_intersects. If you want to use any other spatial operator - and it is rare that you need to - it must be specified.

For example, we can use the st_disjoint operator to select only those schools NOT in Berkeley.

# Select all Alameda County Schools NOT in Berkeley with the disjoint operator
berkeley_schools_disjoint = schools_utm10[berkeley_utm10, , op = st_disjoint]

# Plot the result
plot(berkeley_schools_disjoint$geometry)
plot(berkeley_utm10, 
     col = NA, 
     border = "red", 
     add = T)
## Warning in plot.sf(berkeley_utm10, col = NA, border = "red", add = T): ignoring
## all but the first attribute

There is no need to memorize these spatial operators (aka predicates)! Here is a fantastic sf cheatsheet that lists and briefly explains all these common functions (and many more).


Protected Areas in Alameda County

Let’s load a new dataset, the CA Protected Areas Database (CPAD), to demonstrate these spatial queries in more detail.

This dataset contains all of the protected areas (parks and the like) in California.

We will use this data and the Alameda County census tract data that we created earlier to ask, “What protected areas are within Alameda County?”

First load the CPAD data.

cpad <- st_read(here("notebook_data",
                     "protected_areas",
                     "CPAD_2020a_Units.shp"))
## Reading layer `CPAD_2020a_Units' from data source 
##   `/Users/pattyf/Documents/Dlab/workshops/AY2022/R-Geospatial-Fundamentals/notebook_data/protected_areas/CPAD_2020a_Units.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 17068 features and 21 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -374984.2 ymin: -604454.8 xmax: 540016.3 ymax: 449743.2
## Projected CRS: NAD83 / California Albers

What is the CRS of the CPAD data?

Let’s transform the CRS of the CPAD data to match that used by the alameda data (CA Albers).

st_crs(cpad) == st_crs(alameda)
## [1] TRUE

Let’s plot the data in so that we know what to expect. CPAD is big, so wait for it…

plot(alameda$geometry, col = 'grey', border = "grey")
plot(cpad$geometry, col = 'green', add = T)

We can see from our map that some of the protected areas are completely within Alameda County, some of them overlap, and some are completely outside of the county. To get both of the “inside” and “overlaps” cases we use the st_intersects spatial selection operator, which is the default. Let’s check it out.

cpad_intersects <- cpad[alameda,]  # st_intersects
cpad_within <- cpad[alameda, , op = st_within] # st_within

We can use tmap to explore the difference in the results from st_intersects vs st_within

tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(alameda) +
  tm_polygons(col = "gray", 
              border.col = "grey") +
tm_shape(cpad_intersects) +
  tm_borders(col = "green") +
tm_shape(cpad_within) +
  tm_borders(col = 'red')

Use the tmap layer control to toggle the intersects and within results and examine them more closely.

What you can see from the above, is that by default, st_intersects returns the features that intersect but it does not clip the features to the boundary of Alameda County. For that, we would need to use a different spatial operation - st_intersection.

Let’s try it!

cpad_in_ac <- st_intersection(cpad, alameda)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries

Great! Now, if we scroll the resulting sf object we’ll see that the COUNTY column of our resulting subset gives us a good sanity check on our results. Or does it?

table(cpad_in_ac$COUNTY)
## 
##      Alameda Contra Costa  San Joaquin  Santa Clara 
##          645           12            1            3

Always check your output - both the attribute table & the geometry!

head(cpad_in_ac)
## Simple feature collection with 6 features and 45 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -198139.5 ymin: -48130.76 xmax: -162337.3 ymax: -22665.86
## Projected CRS: NAD83 / California Albers
##      ACCESS_TYP UNIT_ID                             UNIT_NAME SUID_NMA AGNCY_ID
## 64  Open Access     185                  Augustin Bernal Park     8732     1257
## 146 Open Access     366                      San Antonio Park    24832     1228
## 218 Open Access     586 Quarry Lakes Regional Recreation Area    30594     2032
## 394 Open Access    1438               Tennis & Community Park    26243     1257
## 409 Open Access   48353                     Sean Diamond Park    32917     1090
## 480 Open Access    1631                          Tillman Park    26365     1001
##                          AGNCY_NAME        AGNCY_LEV                 AGNCY_TYP
## 64              Pleasanton, City of             City               City Agency
## 146                Oakland, City of             City               City Agency
## 218 East Bay Regional Park District Special District Recreation/Parks District
## 394             Pleasanton, City of             City               City Agency
## 409                 Dublin, City of             City               City Agency
## 480                Alameda, City of             City               City Agency
##                                                 AGNCY_WEB            LAYER
## 64                     http://www.cityofpleasantonca.gov/             City
## 146 http://www2.oaklandnet.com/Government/o/opr/index.htm             City
## 218                               http://www.ebparks.org/ Special District
## 394                    http://www.cityofpleasantonca.gov/             City
## 409        http://www.ci.dublin.ca.us/index.aspx?nid=1458             City
## 480                       http://alamedaca.gov/recreation             City
##     MNG_AG_ID                      MNG_AGENCY       MNG_AG_LEV
## 64       1257             Pleasanton, City of             City
## 146      1228                Oakland, City of             City
## 218      2032 East Bay Regional Park District Special District
## 394      1257             Pleasanton, City of             City
## 409      1090                 Dublin, City of             City
## 480      1001                Alameda, City of             City
##                    MNG_AG_TYP
## 64                City Agency
## 146               City Agency
## 218 Recreation/Parks District
## 394               City Agency
## 409               City Agency
## 480               City Agency
##                                                                        PARK_URL
## 64  http://www.cityofpleasantonca.gov/services/recreation/rec-augustinpark.html
## 146                                                                        <NA>
## 218                                                                        <NA>
## 394                                                                        <NA>
## 409  https://www.dublin.ca.gov/Facilities/Facility/Details/Sean-Diamond-Park-33
## 480                                                                        <NA>
##      COUNTY   ACRES                  LABEL_NAME YR_EST                DES_TP
## 64  Alameda 217.388        Augustin Bernal Park      0            Local Park
## 146 Alameda  10.619            San Antonio Park      0            Local Park
## 218 Alameda 254.616 Quarry Lakes Reg. Rec. Area   2001 Local Recreation Area
## 394 Alameda  15.595     Tennis & Community Park      0            Local Park
## 409 Alameda   4.986           Sean Diamond Park   2018            Local Park
## 480 Alameda   3.586                Tillman Park      0            Local Park
##     GAP_STS    NAME STATE_NAME POP2010 POP10_SQMI POP2012 POP12_SQMI  WHITE
## 64        4 Alameda California 1510271     2029.8 1534551   2062.402 649122
## 146       4 Alameda California 1510271     2029.8 1534551   2062.402 649122
## 218       4 Alameda California 1510271     2029.8 1534551   2062.402 649122
## 394       4 Alameda California 1510271     2029.8 1534551   2062.402 649122
## 409       4 Alameda California 1510271     2029.8 1534551   2062.402 649122
## 480       4 Alameda California 1510271     2029.8 1534551   2062.402 649122
##      BLACK AMERI_ES  ASIAN HAWN_PI HISPANIC  OTHER MULT_RACE  MALES FEMALES
## 64  190451     9799 394560   12802   339889 162540     90997 740573  769698
## 146 190451     9799 394560   12802   339889 162540     90997 740573  769698
## 218 190451     9799 394560   12802   339889 162540     90997 740573  769698
## 394 190451     9799 394560   12802   339889 162540     90997 740573  769698
## 409 190451     9799 394560   12802   339889 162540     90997 740573  769698
## 480 190451     9799 394560   12802   339889 162540     90997 740573  769698
##     MED_AGE AVE_HH_SZ AVE_FAM_SZ HSE_UNITS VACANT OWNER_OCC RENTER_OCC
## 64     36.6       2.7        3.3    582549  37411    291242     253896
## 146    36.6       2.7        3.3    582549  37411    291242     253896
## 218    36.6       2.7        3.3    582549  37411    291242     253896
## 394    36.6       2.7        3.3    582549  37411    291242     253896
## 409    36.6       2.7        3.3    582549  37411    291242     253896
## 480    36.6       2.7        3.3    582549  37411    291242     253896
##     CountyFIPS                       geometry
## 64       06068 POLYGON ((-168734.7 -40681,...
## 146      06068 POLYGON ((-197188.2 -22828....
## 218      06068 MULTIPOLYGON (((-176646.1 -...
## 394      06068 POLYGON ((-167580.2 -36255....
## 409      06068 POLYGON ((-162499.7 -31451....
## 480      06068 POLYGON ((-198139.5 -28042....

Let’s also use an overlay plot to check the output geometry.

tm_shape(alameda) + 
  tm_polygons(col = 'gray', 
              border.col = 'gray') +
tm_shape(cpad_in_ac) + 
  tm_polygons(col = 'ACRES', 
              palette = 'YlGn',
              border.col = 'black', 
              lwd = 0.4, 
              alpha = 0.8,
              title =  'Protected areas in Alameda County, colored by area')

st_intersects or st_intersection?

It really depends! But make sure you understand the difference.

st_intersects is a logical operator that returns true if two geometries intersect in any way. When used to subset (or filter) a spatial dataframe, st_intersects returns those features in a dataframe that intersect with the filter dataframe.

On the other hand, st_intersection returns a new spatial dataframe that is the set intersection of the two dataframes, including both the geometries and attributes of the intersecting features. Use st_intersection with caution and always check your results.

Exercise: Spatial Relationship Query

It’s your turn.

Write a spatial relationship query to create a new dataset containing only the BART stations in Berkeley.

We already have these two datasets loaded as bart_stations and berkeley_utm.

Check the CRS of the two layers and plot these two datasets in an overlay map. If you need to transform the CRS of the bart_stations, make them match the CRS of berkeley_utm.

Then, write your own code to: 1. Spatially select the BART stations that are within Berkeley 2. Plot the Berkeley boundary and then overlay the selected BART stations.

# Check CRS of bart_stations
st_crs(bart_stations)
## Coordinate Reference System:
##   User input: WGS 84 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4326]]
#transform bart stations to make berkeley_utm10
bart_utm10 = st_transform(bart_stations, st_crs(berkeley_utm10))

Plot the data together

plot(bart_utm10$geometry)
plot(berkeley_utm10$geometry, border = 'blue', add=T)

Your code here!

Now, in the cell below, write the code to spatially select the Berkeley BART stations, then make the map.

# YOUR CODE HERE:

# Spatially select the BART stations within Berkeley

# Plot the Bart stations in Berkeley overlaid on top of the Berkeley City Boundary

Solution hidden here!

To see it, right-click and select “inspect element” in your browser (or look in the 06_Spatial_Queries.Rmd file, line 599).


6.3 Proximity Analysis

Now that we’ve seen the basic idea of spatial measurement and spatial relationship queries, let’s take a look at a common analysis that combines those concepts: promximity analysis.

Proximity analysis seeks to identify near features - or, in other words, all features in a focal feature set that are within some maximum distance of features in a reference feature set.

A very common workflow for this analysis is:

  1. Buffer around the features in the reference dataset to create buffer polygons.

  2. Run a spatial relationship query to find all focal features that intersect (or are within) the buffer polygons.


Let’s read in our bike boulevard data again.

Then we’ll find out which of our Berkeley schools are within a block’s distance (200 meters) of the bike boulevards.

bike_blvds <- st_read(here("notebook_data",
                           "transportation",
                           "BerkeleyBikeBlvds.geojson"))
## Reading layer `BerkeleyBikeBlvds' from data source 
##   `/Users/pattyf/Documents/Dlab/workshops/AY2022/R-Geospatial-Fundamentals/notebook_data/transportation/BerkeleyBikeBlvds.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 211 features and 10 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: 561541.2 ymin: 4189007 xmax: 566451.6 ymax: 4193483
## Projected CRS: WGS 84 / UTM zone 10N
plot(bike_blvds$geometry)

Of course, we need to reproject the boulevards to our projected CRS.

bike_blvds_utm10 = st_transform(bike_blvds, st_crs(berkeley_utm10))

Now we can create our 200 meter bike boulevard buffers.

bike_blvds_buf = st_buffer(bike_blvds_utm10, dist = 200)

Now let’s overlay everything.

tm_shape(berkeley_utm10) + 
  tm_polygons(col = 'lightgrey') + 
tm_shape(bike_blvds_buf) + 
  tm_polygons(col = 'pink', alpha = 0.5) +
tm_shape(bike_blvds_utm10) + 
  tm_lines() + 
tm_shape(berkeley_schools) + 
  tm_dots(col = 'purple', size = 0.2)

Great! Looks like we’re all ready to run our spatial relationship query to complete the proximity analysis. At this point (pun intended) select the schools that are in within the bike boulevard buffer polygons.

schools_near_blvds = berkeley_schools[bike_blvds_buf, ]

# or
#schools_near_blvds = berkeley_schools[bike_blvds_buf, , op = 'st_within']

Now let’s overlay again, to see if the schools we selected make sense.

tm_shape(berkeley_utm10) + 
  tm_borders() + 
  
# add the bike blvd buffer polygons  
tm_shape(bike_blvds_buf) + 
  tm_polygons(col = 'pink', alpha = 0.5) +

# add the bike blvd line features  
tm_shape(bike_blvds_utm10) + 
  tm_lines(lwd=2) + 

# add all berkeley schools  
tm_shape(berkeley_schools) + 
  tm_dots(col = 'purple', size = 0.2) +

# add schools near bike boulevards in yellow
tm_shape(schools_near_blvds) + 
  tm_dots(col = 'yellow', size = 0.2)

Leveling up!

You can use st_distance and its companion function st_nearest_feature to compute the distance between each feature and the nearest bike boulevard. The st_nearest_feature function returns the ID of the closest feature.

# Identify the nearest bike boulevard for each school
nearest_blvd = st_nearest_feature(berkeley_schools, bike_blvds_utm10)

# take a look!
nearest_blvd
##  [1]  71 171  39 136  42  30 163 172 127 171 189 156 137  33 187 197  50 207 169
## [20] 102 125 137   3 109 207  76 135 173  56 102 146  76

To repeat, the st_nearest_feature function returns the ID of the closest feature. These are stored in nearest_blvd.

Then we can calculate the distance between each school and it’s nearest bike boulevard.

berkeley_schools$bike_blvd_dist <- st_distance(berkeley_schools, 
                                              bike_blvds_utm10[nearest_blvd,], 
                                              by_element = TRUE)

#take alook
berkeley_schools[order(berkeley_schools$bike_blvd_dist),]
## Simple feature collection with 32 features and 8 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 561890.9 ymin: 4189109 xmax: 566342 ymax: 4194327
## Projected CRS: NAD83 / UTM zone 10N
## First 10 features:
##                                        Site                Address     City
## 407                          Berkeley Adult    1701 San Pablo Ave. Berkeley
## 27         Berkeley Arts Magnet at Whittier       2015 Virginia St Berkeley
## 437              Ecole Bilingue De Berkeley      1009 Heinz Avenue Berkeley
## 32                       Leconte Elementary    2241 Russell Street Berkeley
## 442                 School of the Madeleine     1225 Milvia Street Berkeley
## 33                     Malcolm X Elementary         1731 Prince St Berkeley
## 444                Walden Center and School   2446 McKinley Avenue Berkeley
## 445                       West-Wind Academy 1551 University Avenue Berkeley
## 432                            Academy, The   2722 Benvenue Avenue Berkeley
## 35  Rosa Parks Environmental Science Magnet        920 Allston Way Berkeley
##     State    Type API     Org                 geometry bike_blvd_dist
## 407    CA   Adult   0  Public POINT (562153.5 4192013)   1.288383 [m]
## 27     CA      ES 829  Public   POINT (564122 4192361)  13.848161 [m]
## 437    CA     K-8   0 private POINT (562512.4 4189869)  15.162313 [m]
## 32     CA      ES 748  Public POINT (564890.8 4190253)  15.332395 [m]
## 442    CA     K-8   0 private POINT (564013.4 4193297)  15.892361 [m]
## 33     CA      ES 834  Public POINT (563890.6 4189674)  27.250406 [m]
## 444    CA     K-6   0 private POINT (563898.5 4190992)  92.556921 [m]
## 445    CA 1 to 12   0 private POINT (563251.3 4191717)  93.426741 [m]
## 432    CA     K-8   0 private POINT (565543.3 4190686)  94.064527 [m]
## 35     CA      ES 735  Public POINT (562038.4 4191134) 107.902846 [m]

Exercise: Proximity Analysis

Now it’s your turn to try out a proximity analysis!

Write your own code to find all BERKELEY schools within walking distance (1 km) of a BART station.

As a reminder, let’s break this into steps:

  1. Spatially select the BART stations in Berkeley from the bart_utm10 dataframe
  2. Buffer your Berkeley BART stations to 1 km (HINT: remember your units!)
  3. Spatially select the schools within walking distance to the Berkeley BART stations.
  4. As always, plot your results for a good visual check!

To see the solution, look at the hidden text below.

Your code here

# YOUR CODE HERE:

# spatially select the Berkeley BART stations 
# # (You may have done this in a previous exercise.)

# buffer the BART stations to 1 km

# spatially select the schools within the buffers

# map your results with tmap
# # plot the Berkeley boundary (for reference) in lightgrey

# add the BART stations (for reference) to the plot in green

# add the BART buffers (for check) in lightgreen

# add all Berkeley schools (for reference) in black

# add the schools near BART (for check) in yellow

Solution hidden here!

To see it, right-click and select “inspect element” in your browser (or look in the 06_Spatial_Queries.Rmd file, line 770).

Bonus Exercise

Compute the distance between each Berkeley school and its nearest BART station!

# YOUR CODE HERE:

6.4 Recap

Leveraging what we’ve learned in our earlier lessons, we got to work with map overlays and start answering questions related to proximity. Key concepts include:

  • Measuring area and length
    • st_area,
    • st_length
    • st_distance
  • Spatial relationship queries
    • st_intersects,
    • st_intersection
    • st_within, etc.
  • Buffer analysis
    • st_buffer

 D-Lab @ University of California - Berkeley
 Team Geo